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Abstract 


The genomic evolution inherent to cancer relates directly to a renewed focus on the vo¬ 
luminous next generation sequencing (NGS) data, and machine learning for the inference of 
explanatory models of how the (epi)genomic events are choreographed in cancer initiation and 
development. However, despite the increasing availability of multiple additional -omics data, 
this quest has been frustrated by various theoretical and technical hurdles, mostly stemming 
from the dramatic heterogeneity of the disease. In this paper, we build on our recent works 
on “selective advantage” relation among driver mutations in cancer progression and investi¬ 
gate its applicability to the modeling problem at the population level. Here, we introduce 
PiCnIc (Pipeline for Cancer Inference), a versatile, modular and customizable pipeline to ex¬ 
tract ensemble-level progression models from cross-sectional sequenced cancer genomes. The 
pipeline has many translational implications as it combines state-of-the-art techniques for sam¬ 
ple stratification, driver selection, identification of fitness-equivalent exclusive alterations and 
progression model inference. We demonstrate PiCnIc’s ability to reproduce much of the current 
knowledge on colorectal cancer progression, as well as to suggest novel experimentally verifiable 
hypotheses. 

Keywords: Cancer evolution; Selective advantage; Bayesian Structural Inference 

Statement of significance: A causality based new machine learning Pipeline for Cancer Infer¬ 
ence (^PicNic; is introduced to infer the underlying somatic evolution of ensembles of tumors from 
next generation sequencing data. PicNic combines techniques for sample stratification, driver selec¬ 
tion and identification of fitness-equivalent exclusive alterations to exploit a novel algorithm based 
on Suppes’ probabilistic causation. The accuracy and translational significance of the results are 
studied in details, with an application to colorectal cancer. PicNic pipeline has been made publicly 
accessible for reproducibility, interoperability and for future enhancements. 

1 Introduction 

Since the late seventies evolutionary dynamics, with its interplay between variation and selection, 
has progressively provided the widely-accepted paradigm for the interpretation of cancer emergence 
and development [T]-[^. Random alterations of an organism’s (epi)genome can sometimes confer 
a functional selective advantag^ to certain cells, in terms of adaptability and ability to survive 
and proliferate. Since the consequent clonal expansions are naturally constrained by the avail¬ 
ability of resources (metabolites, oxygen, etc.), further mutations in the emerging heterogeneous 
tumor populations are necessary to provide additional fitness of different kinds that allow survival 
and proliferation in the unstable micro environment. Such further advantageous mutations will 
eventually allow some of their sub-clones to outgrow the competing cells, thus enhancing tumor’s 
heterogeneity as well as its ability to overcome future limitations imposed by the rapidly exhaust¬ 
ing resources. Competition, predation, parasitism and cooperation have been in fact theorized as 
co-present among cancer clones [^. 

In the well-known vision of Hanahan and Weinberg [^[^, the phenotypic stages that charac¬ 
terize this multistep evolutionary process are called hallmarks. These can be acquired by cancer 
cells in many possible alternative ways, as a result of a complex biological interplay at several 
spatio-temporal scales that is still only partially deciphered [^. In this framework, we distinguish 

^For this and other technical terms commonly used in the statistics and cancer biology communities we provide 
a Glossary in the Supplementary Material. 
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“alterations” driving the hallmark acquisition process (i.e., drivers) by activating oncogenes or in¬ 
activating tumor suppressor genes, from those that are transferred to sub-clones without increasing 
their fitness (i.e., passengers) [^. Driver identification is a modern challenge of cancer biology, as 
distinct cancer types exhibit very different combinations of drivers, some cancers display mutations 
in hundreds of genes [^, and the majority of drivers is mutated at low frequencies (“long tail” 
distribution), hindering their detection only from the statistics of the recurrence at the population- 
level 

Cancer clones harbour distinct types of alterations. The somatic (or genetic) ones involve 
either few nucleotides or larger chromosomal regions. They are usually catalogued as mutations 
- i.e., single nucleotide or structural variants at multiple scales (insertions, deletions, inversions, 
translocations) - of which only some are detectable as Copy Number Alterations (CNAs), most 
prevalent in many tumor types [^. Also epigenetic alterations, such as DNA methylation and 
chromatin reorganization, play a key role in the process [^. The overall picture is confounded 
by factors such as genetic instability [^, tumor-microenvironment interpla y and by the 

influence of spatial organization and tissue specificity on tumor development | 16 P] 

Significantly, in many cases, distinct driver alterations can damage in a similar way the same 
functional pathway, leading to the acquisition of new hallmarks pT}]^ . Such alterations individu¬ 
ally provide an equivalent fitness gain to cancer cells, as any additional alteration hitting the same 
pathway would provide no further selective advantage. This dynamic results in groups of driver 
alterations that form mutually exclusive patterns across tumor samples from different patients (i.e., 
the sets of alterations that are involved in the same pathways tend not to occur mutated together). 
This phenomenon has significant translational consequences. 

An immediate challenge posed by this state of affairs is the dramatic heterogeneity of cancer, both 
at the inter-tumor and at the intra-tumor levels [^. The former manifests as different patients with 
the same cancer type can display few common alterations. This obsersvation led to the development 
of techniques to stratify tumors into subtypes with different genomic signatures, prognoses and 
response to therapy [^. The latter form of heterogeneity refers to the observed genotypic and 
phenotypic variability among the cancer cells within a single neoplastic lesion, characterized by the 
coexistence of more than one cancer clones with distinct evolutionary histories [^ . 

Cancer heterogeneity poses a serious problem from the diagnostic and therapeutic perspective 
as, for instance, it is now acknowledged that a single biopsy might not be representative of other 
parts of the tumor, hindering the problem of devising effective treatment strategies [^. Therefore, 
presently the quest for an extensive etiology of cancer heterogeneity and for the identification of 
cancer evolutionary trajectories is central to cancer research, which attempts to exploit the massive 
amount of sequencing data available through public projects such as The Cancer Genome Atlas 
(TCGA) (2^. 

Such projects involve an increasing number of cross-sectional (epi)genomic profiles collected via 
single biopsies of patients with various cancer types, which might be used to extract trends of cancer 
evolution across a population of sample^ Higher resolution data such as multiple samples collected 
from the same tumor [^, as well as single-cell sequencing data [^, might be complementarily 
used to face the same problem within a specific patient. However, the lack of public data coupled 

^We mention that much attention has been recently casted on newly discovered cancer genes affecting global 
processes that are apparently not directly related to cancer development, such as cell signa ling , chromatin and 
epigenomic regulation, RNA splicing, protein homeostasis, metabolism and lineage maturation llOl. 

^At the time of this writing, in TCGA, sample sizes per cancer type are in the order of a few hundreds. Such 
numbers are expected to increase in the near future, with a clear benefit for all the statistical approaches to analyze 
cancer data which currently lack a proper background of data. 
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to the problems of accuracy and reliability, currently prevents a straightforward application [^ . 

These different perspectives lead to the different mathematical formulations of the problem of 
inferring a cancer progression model from genomic data, and a need for versatile computational tools 
to analyze data reproducibly - two intertwined issues examined at length in this paper [^. Indeed, 
such models and tools can be focused either on characteristics of a population, i.e. ensemble-level, 
or on multiple clonality in a single-patient In general, both problems deal with understanding the 
temporal ordering of somatic alterations accumulating during cancer evolution, but use orthogonal 
perspectives and different input data - see Figure for a comparison. This paper proposes a 
new computational approach to efficiently deal with various aspects of the problem at a patient 
population level, relegating the other aspects to future publications. 

Ensemble-level cancer evolution. It is thus desirable to extract a probabilistic graphical model 
explaining the statistical trend of accumulation of somatic alterations in a population of n cross- 
sectional samples collected from patients diagnosed with a specific cancer. To normalize against 
the experimental conditions in which tumors are sampled, we only consider the list of alterations 
detected per sample - thus, as 0/1 Bernoulli random variables. 

Much of the difficulty lies in estimating the true and unknown trends of selective advantage 
among genomic alterations in the data, from such observations. This hurdle is not unsurmountable, 
if we constrain the scope to only those alterations that are persistent across tumor evolution in all 
sub-clonal populations, since it yields a consistent model of a temporal ordering of mutations. 
Therefore, epigenetic and trascriptomic states, such as hyper and hypo-methylations or over and 
under expression, could only be used, provided that they are persistent through tumor development 



Historically, the linear model of colorectal tumor progression by Vogelstein is an instance of 
an early solution to the cancer progression proble m [3Q] . That approach was later generalized to 
accommodate tree-models of branched evolution (m|| 34| and later, further generalized to the infer¬ 
ence of directed acyclic graph models, with several distinct strategies [35||^ . We contributed to 
this research program with the Cancer Progression Extraction with Single Edges (CAPRESE) and the 
Cancer Progression Inference (CAPRI) algorithms, which are currently implemented in TRONCO, an 
open source R package for Translational Oncology available in standard repositories [39}|4l] . Both 
techniques rely on Suppes’ theory of probabilistic causation to define estimators of selective advan¬ 
tage [^, are robust to the presence of noise in the data and perform well even with limited sample 
sizes. The former algorithm exploits shrinkage-like statistics to extract a tree model of progression, 
the latter combines bootstrap and maximum likelihood estimation with regularization to extract 
general directed acyclic graphs that capture branched, independent and confluent evolution. Both 
algorithms represent the current state-of-the-art approach to this problem, as they outperform 
others in speed, scale and predictive accuracy. 

Clonal architecture in individual patients. A closely related problem addresses the detection 
of clonal signatures and their prevalence in individual tumors, a problem complicated by intra-tumor 
heterogeneity. 

Even though this phylogenetic version of the progression inference problem naturally relies on 
data produced from single-cell sequencing assays [4^[44] , the majority of approaches still make use 
of bulk sequencing data, usually from multiple biopsies of the same tumors [^[45] . Indeed, several 
approaches try to extract the clonal signature of single tumors from allelic imbalance proportions. 
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a problem made difficult as sequenced samples usually contain a large number of cells belonging to 
a collection of sub-clones resulting from the complex evolutionary history of the tumor [46 - 

We keep the current work focused on the inference of progression models at the ensemble level, 
and plan to return to this variant to the problem in another publication. 

2 The PicNic pipeline 

We report on the design, development and evaluation of the Pipeline for Cancer Inference (PicNic) 
to extract ensemble-level cancer progression models from cross-sectional data (Figure [^. PicNic 
is versatile, modular and customizable; it exploits state-of-the-art data processing and machine 
learning tools to: 

1. identify tumor subtypes and then in each subtype; 

2. select (epi)genomic events relevant to the progression; 

3. identify groups of events that are likely to be observed as mutually exelusive] 

4. infer progression models from groups and related data, and annotate them with associated 
statistical confidence. 

All these steps are necessary to minimize the confounding effects of inter-tumor heterogeneity, which 
are likely to lead to wrong results when data is not appropriately pre-processecQ 

In each stage of PicNic different techniques can be employed, alternatively or jointly, according 
to specific research goals, input data, and cancer type. Prior knowledge can be easily accommo¬ 
dated into our pipeline, as well as the computational tools discussed in the next subsections and 
summarized in Figure The rationale is similar in spirit to workflows implemented by consortia 
such as TCGA to analyze huge populations of cancer samples [^[^. One of the main novelties 
of our approach, is the exploitation of groups of exclusive alterations as a proxy to detect fitness- 
equivalent trajectories of cancer progression. This strategy is only feasible by the hypothesis-testing 
features of the recently developed CAPRI algorithm, an algorithm uniquely addressing this crucial 
aspect of the ensemble-level progression inference problem [4Q] . 

In the Results section, we study in details a specific use-case for the pipeline, processing colorectal 
cancer data from TCGA, where it is able to re-discover much of the existing body of knowledge 
about colorectal cancer progression. Based on the output of this pipeline, we also propose novel 
experimentally-verifiable hypotheses. 

2.1 Reducing inter-tumor heterogeneity by cohort subtyping 

In general, for each of n tumors (patients) we assume relevant (epi)genetic data to be available. We 
do not put constraints on data gathering and selection, leaving the user to decide the appropriate 
“resolution” of the input data. For instance, one might decide whether somatic mutations should 
be classified by type or by location, or aggregated. Or, one might decide to lift focal CNAs to 

^The genuine selectivity relationship sought to be inferred are subject to the vagaries of Simpson’s paradox; it 
can change, or worst reverse, when we try to infer them from data not suitably pre-processed. This e ffect (due to 
such paradox) manifests as data are sampled from a highly heterogenous mixture of populations of cells [^. PiCnIc 
uses various mechanisms to avoid these pitfalls. In this context, it should be pointed out that input bulk sequencing 
data suffers also from intra-tumor heterogeneity issues, which are unfortunately intrinsic to the technology. 
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the lower resolution of cytobands or full arms (e.g., in a kidney cancer cohort where very long 
CNAs are more common than focal events [^). These choices depend on data and on the overall 
understanding of such alterations and their functional effects for the cancer under study, and no 
single all-encompassing rationale may be provided. 

With these data at hand, we might wish to identify cancer subtypes in the heterogeneous mixture 
of input samples. In some cases the classification can benefit from clinical biomarkers, such as 
evidences of certain cell types j^, but in most cases we will have to rely on multiple clustering 
techniques at once, see, e.g., (56|[^ . Many common approaches cluster expression profiles [^, often 
relying on non-negative matrix factorization techniques or earlier approaches such as /c-means, 
Gaussians mixtures or hierarchical/spectral clustering - see the review in [^. For glioblastoma 
and breast cancer, for instance, mRNA expression subtypes provides good correlation with clinical 
phenotypes [^-65 . However, this is not always the case as, e.g., in colorectal cancer such clusters 
mismatch with survival and chemotherapy response [^. Clustering of full exome mutation profiles 
or smaller panels of genes might be an alternative as it was shown for ovarian, uterine and lung 
cancers [^[^ . 

Using pipelines such as PicNic, we expect that the resulting subtypes will be routinely in¬ 
vestigated, eventually leading to distinct progression models, which shall be characteristic of the 
population-level trends of cancer initiation and progression. 


2.2 Selection of driver events 

In subtypes detection, it becomes easier to find similarities across input samples when more al¬ 
terations are available, as features selection gains precision. In progression inference, instead, one 
wishes to focus on m ^ n driver alterations, which ensure also an appropriate statistical ratio 
between sample size (n, here the subtype size) and problem dimension (m). 

Multiple tools filter out driver from passenger mutations. MutSigCV identifies drivers mutated 
more frequently than background mutation rate [^. 0 ncod rive FM avoids such estimation but 

looks for functional mutations [^. 0ncodriveCLUST scans mutations clustering in small regions 
of the protein sequence [^. MuSiC uses multiple types of clinical data to establish correlations 
among mutation sites, genes and pathways [^. Some other tools search for driver CNAs that affect 
protein expression [^. All these approaches use different statistical measures to estimate signs of 
positive selection, and we suggest using them in an orchestrated way, as done by platforms such as 
Intogen [T^ . 

We anticipate that such tools will run independently on each subtype, as driver genes will likely 
differ across them, mimicking the different molecular properties of each group of samples; also, lists 
of genes produced by these tools might be augmented with prior knowledge about tumor suppressors 
or oncogenes. 


2.3 Fitness equivalence of exclusive alterations 

When working at the ensemble-level, identification of “groups of mutually exclusive” alterations is 
crucial to derive a correct inference. This step of PicNic is another attempt to resolve part of the 
inter-tumor heterogeneity, as such alterations cot^/dlead to the same phenotype (i.e., hence resulting 
“equivalent” in terms of progression), despite being genotypically “alternative”, i.e., exclusive, across 
the input cohort. This information shall be used to detect alternative routes to cancer progression 
which capture the specificities of individual patients. 
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A plethora of recent tools can be used to detect groups of fitness equivalent alterations, according 
to the data available for each subtype; greedy approaches 
MEMO, which constrain search-space with network priors 



or their optimizations, such as 


This strategy is further improved 
in MUTEX, which scans mutations and focal CNAs for genes with a common downstream effect 
in a curated signalling network, and selects only those genes that significantly contributes to the 
exclusivity pattern [^. Other tools such as Dendrix, MDPFinder, Multi-Dendrix, CoMEt, MEGSA 
or ME, employ advanced statistics or generative approaches without priors [78}|^ . 

In such groups, we distinguish between hard and soft forms of exclusivity, the former assuming 
strict exclusivity among alterations, with random errors accounting for possible overlaps (i.e., the 
majority of samples do not share alterations from such groups), the latter admitting co-occurrences 
(i.e., some samples might have common alterations, within a group) [TT] . 

CAPRI is currently the only algorithm which incorporates this type of information, in inferring 
a model. Each of these groups are in fact associated with a ^%estahle hypothesis^^ wiitteD. in the well- 
known language of propositional Boolean formula^ Consider the following example: we might be 
informed that APC and ctnnbI mutations show a trend of soft-exclusivity in our cohort - i.e., some 
samples harbor both mutations, but the majority just one of the two mutated genes. Since such 
mutations lead to /3-catenin deregulation (the phenotype), we might wonder whether such state of 
affairs could be responsible for progression initiation in the tumors under study. An affermative 
response would equate, in terms of progression, the two mutations. To test this hypothesis, one may 
spell out formula APC V CTNNbI to CAPRI, which means that we are suggesting to the inference 
engine that, besides the possible evolutionary trajectories that might be inferred by looking at the 
two mutations as independent, trajectories involving such a “composite” event, shall be considered 
as well. It is then up to CAPRI to decide which, of all such trajectories, is significant, in a statistieal 
sense. 

In general, formulas allow users to test general hypotheses about complex model structures 
involving multiple genes and alterations. These are useful in many cases: for instance, where 
we are processing samples which harbour homozygous losses or inactivating mutations in certain 
genes (i.e., equally disruptive genomic events), or when we know in advance that certain genes are 
controlling the same pathway, and we might speculate that a single hit in one of those decreases the 
selection pressure on the others. We note that, with no hypothesis, a model with such alternative 
trajectories cannot he analyzed, due to various computational limitations inherent to the inferential 
algorithms (see [4Q]). 

From a practical point of view, CAPRI’s formulas/hypotheses-testing features “help” the inference 
process, but do not “force” it to select a specific model, i.e., the inference is not biased. In this 
sense, the trajectories inferred by examining these composite model structures (i.e., the formulas) 
are not given any statistical advantage for inclusion in the final model. However, in spite of a natural 
temptation to generate as many hypotheses as possible, it is prudent to always limit the number 
of hypotheses according to the number of samples and alterations. Note that this approach can 
also be extended to accommodate, for instance, co-occurrent alterations in significantly mutated 
subnetworks [M 85 . 


^There, logical connectives such as 0 (the logical “xor”) act as a proxy for hard-exclusivity, and V (the logical 
“disjunction”) for soft one. Besides from exclusivity groups, other connectives such as logical conjunction can be 
used. 
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2.4 Progression inference and confidence estimation 

We use CAPRI to reconstruct cancer progression models of each identified molecular subtype, pro¬ 
vided that there exist a reasonable list of driver events and the groups of fitness-equivalent exclusive 
alterations. Since currently CAPRI represents the state of the art, and supports complex formulas 
for groups of alterations detected in the earlier PicNic step, it was well-suited for the task. 

CAPRTs input is a binary n x (m + /c) matrix M with n samples (a subtype size), m driver 
alteration events (0/1 Bernoulli random variables) and k testable formulas. Each sample in M is 
described by a binary sequence: the I’s denote the presence of alterations. CAPRI first performs a 
computationally fast scan of M to identify a set S of plausible selective advantage relations among 
the driver alterations and the formulas; then, it reduces S to the most relevant ones, S C S Each 
relation is represented as an edge connecting drivers/formulas in a Graphical Model - which shall be 
termed Suppes-Bayes Causal Network. This network represents the joint probability distributioi^of 
observing a set of driver alterations in a cancer genome, subject to constraints imposed by Suppes’ 
probabilistic causation formalism [4^ . 

Set S is built by a statistical procedure. Among any pair of input drivers/formulas x and 
CAPRI postulates that x ^ y G S could be a selective advantage relation with “x selecting for if 
it estimates that two conditions hold 

1. “x is earlier than 

2. “x’s presence increases the probability of observing y^\ 

Such claims, grounded in Suppes’ theory of probabilistic causation, are expressed as inequalities 
over marginal and conditional distributions of x and y. These are assessed via a standard Mann- 
Withney U test after the distributions are estimated from a reasonable number (e.g., 100) of non- 
parametric bootstrap resamples of M (see Supplementary Material). CAPRI’s increased performance 
over existing methods can be motivated by the reduction of the state space within which models 
are searched, via S. 

Optimization of S is central to our tolerance to false positives and negatives in S. We would like 
to select only the minimum number of relations which are true and statistically supported, and build 
our model from those. CAPRI’s implementation in TRONCO selects a subset by optimizing a 
seore funetion which assigns to a model a real number equal to its log-likelihood (probability of 
generating data for the model) minus a penalty term for model complexity - a regularization term 
increasing with 5’s size, and hence penalizing overly complex models. It is a standard approach to 
avoid overfitting, and usually relies on the Akaike or the Bayesian Infornnation Criterion (AlC or BIC) 
as regularizers. Both scores are approximately correct; AlC is more prone to overfitting but likely 
to provide also good predictions from data and is better when false negatives are more misleading 
than positive ones. BIC is more prone to underfitting errors, thus more parsimonious and better 

^Technically, for a set of m alterations modeled by variables xi, ... ,Xm, such a network is a Graphical Model 
representing the factorization of the joint distribution - T(xi,... ,Xm) “ of observing any of the alterations in a 
genome (i.e., x^ = 1). This factorization is made compact as the model encodes the statistical dependencies in its 
structure via 

m 

'P(xi, . . . ,Xm) = Y[ I 

i=l 

where tt^ = {xj | Xj —)■ x^ G *S} are the “parents” of the Tth node. These are those from which the presence of the 
Tth alteration is predicted. In our approach these edges are the pictorial representation of the selective advantage 
relations where the alterations in tt^ select for x^. 
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in opposite direction. As often done, we suggest approaches that to combine but distinguish which 
relations are selected by BIC versus A 1C. Details on the algorithm are provided as see Supplementary 
Material. 

Statistical confidence of a model. In-vitro and in-vivo experiments provide the most convinc¬ 
ing validation for the newly suggested selective advantage relations and hypotheses, yet this is out 
of reach in some cases. 

Nonetheless, statistical validation approaches can be used almost universally to assess the con¬ 
fidence of edges, parent sets and whole models, either via hypothesis-testing or bootstrap and cross- 
validation scores for Graphical Models. We briefly discuss approaches that are implemented in 
T RON CO, and refer to the Supplementary Materials for additional details. 

First, CAPRI builds S by computing two p-values per edge, for the confidence in condition (1) 
and (2). In addition, for each edge x ^ y, it computes a third p-value via hypergeometric testing 
against the hypothesis that the co-occurrence of x and y is due to chance. These p-values measure 
confidence in the direction of each edge and the amount of statistical dependence among x and y. 

Second, for each model inferred with CAPRI we can estimate (a posteriori) how frequently our 
edges would be retrieved if we resample from our data {non-parametric bootstrap), or from the 
model itself, assuming its correctness {parametric bootstrap) [^. Also, we can measure the bias 
in CAPRI’s construction of S due to the random procedure which estimates the distributions in 
condition (1) and (2) {statistical hootstidip). 

Third, scores can be computed to quantify the consistency for the model against bias in the 
data and models. For instance, non-exhaustive k-fo\d cross-validation can be used to compute the 
entropy loss for the whole model, and the prediction and posterior classification errors for each edge 
or parent set ^7\ . 

3 Results 

3.1 Evolution in a population of MSI/MSS colorectal tumors. 

It is common knowledge that colorectal cancer (CRC) is a heterogeneous disease comprising different 
molecular entities. Indeed, it is currently accepted that colon tumors can be classified according 
to their global genomic status into two main types: micro satellite unstable tumors (MSI), fur¬ 
ther classified as high or low, and micro satellite stable (MSS) tumors (also known as tumors with 
chromosomal instability). This taxonomy plays a significant role in determining pathologic, clinical 
and biological characteristics of CRC tumors [^. Regarding molecular progression, it is also well 
established that each subtype arises from a distinctive molecular mechanism. While MSS tumors 
generally follow the classical adenoma-to-carcinoma progression described in the seminal work by 
Vogelstein and Fearon [^, MSI tumors result from the inactivation of DNA mismatch repair genes 
like MLH-1 [^ . 

With the aid of the TRONCO package, we instantiated PicNic to process colorectal tumors 
freely available through TCGA project COADREAD (see Supplementary Figure SI), and in¬ 
ferred models for the MSS and MSI-HIGH tumor subtypes (shortly denoted MSI) annotated by the 
consortium. In doing so, we used a combination of background knowledge produced by TCGA and 
new computational predictions; to a different degree, some knowledge comes from manual curat ion 
of data and other from tools mentioned in PicNic’s description (see Figure]^. Data and exclusiv- 
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ity groups for MSI tuuiors are shown in Figure the analogous for MSS tumors is provided as 
Supplementary Material. 

For the models inferred, which are shown in Figures and we evaluated various forms of 
statistical confidence measured as p-values, bootstrap scores (in what follows, npb denotes non- 
parametric bootstrap and the closer to 100 the better), and cross-validation statistics reported 
in the Supplementary Material. Many of the postulated selective advantage relations (i.e., model 
edges) have very strong statistical support for COADREAD samples, although events with similar 
marginal frequency may lead to ambiguous imputed temporal ordering (i.e., the edge direction). In 
general, we observed that overall the estimates are slightly better in the MSS cohort (entropy loss 
< 1% versus 3.8%), which is expected given the difference in sample size of the two datasets (152 
versus 27 samples), see Material and Methods for details. 

Interpretation of the models. Our models capture the well-known features distinguishing MSS 
and MSI tumors: for the former APC, KRAS and tp53 mutations as primary events together with 
chromosomal aberrations, for the latter braf mutations and lack of chromosomal alterations. Of all 
33 driver genes, 15 are common to both models - e.g., APC, braf, KRAS, NRAS, tp53 and fam123b 
among others (mapped to pathways like wnt, mark, apoptosis or activation of T-cell lymphocites), 
although in different relationships (position in the model), whereas new (previously un-implicated) 
genes stood out from our analysis and deserve further research. 

MSS (Microsatellite Stable). In agreement with the known literature, in addition to KRAS, tp 53 
and APC as primary events, we identify PTEN as a late event in the carcinogenesis, as well as 
NRAS and KRAS converging in igf 2 amplification, the former being “selected by” TP53 muta¬ 
tions (npb 49%), the latter “selecting for” pik3ca mutations (npb 81%). The leftmost portion 
of the model links many WNT genes, in agreement with the observation that multiple con¬ 
current lesions affecting such pathway confer selective advantage. In this respect, our model 
predicts multiple routes for the selection of alterations in SOx9 gene, a transcription factor 
known to be active in colon mucosa [^. Its mutations are directly selected by apc/ctnnbI 
alterations (though with low npb score), by arid 1 A (npb 34%) or by fbxw 7 mutations (npb 
49%), an early mutated gene that both directly, and in a redundant way via CTNNBl, relates 
to SOx9. The SOX family of transcription factors have emerged as modulators of canoni¬ 
cal WNT//3-catenin signaling in many disease contexts [^. Also interestingly, fbxw 7 has 
been previously reported to be involved in the malignant transformation from adenoma to 
carcinoma [^. The rightmost part of the model involves genes from various pathways, and 
outlines the relation between KRAS and the pi3k pathway. We indeed find selection of pik3ca 
mutations by KRAS ones, as well as selection of the whole MEMO module (npb 64%), which is 
responsible for the activation of the pi3k pathway . smad 4 proteins relate either to KRAS 
(npb 34%), and fam123b (through atm) and tcf7l 2 converge in dkk 2 or dkk4 (npb 81, 17 
and 34%). 

MSI-HIGH (Microsatellite Unstable). In agreement with the current literature, braf is the most 
commonly mutated gene in MSI tumors [^. CAPRI predicted convergent evolution of tu¬ 
mors harbouring FBXW7 or APC mutations towards deletions/mutations of NRAS gene (npb 
21, 28 and 54%), as well as selection of smad2 or smad4 mutations by fam123b mutations 
(npb 23 and 46%), for these tumors. Relevant to all MSI tumors seems again the role of 
the pi3k pathway. Indeed, a relation among APC and pik3ca mutations was inferred (npb 
66%), consistent with recent experimental evidences pointing at a synergistic role of these 
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mutations, which co-occurr in the majority of human colorectal cancers [^. Similarly, we 
find consistently a selection trend among APC and the whole MEMO module (npb 48%). In¬ 
terestingly, both mutations in APC and erbb3 select for KRAS mutations (npb 51 and 27%), 
which might point to interesting therapeutic implications. In contrast, mutations in braf 
mostly select for mutations in ACVRlB (npb 36%), a receptor that once activated phospho- 
rylates SMAD proteins. It forms receptor complex with acvr2a, a gene mutated in these 
tumors that selects for tcf7l2 mutations (npb 34%). Tumors harbouring tp53 mutations 
are those selected by mutations in axin2 (npb 32%), a gene implicated in wnt signalling 
pathway, and related to unstable gastric cancer development [^. Inactivating mutations in 
this gene are important, as it provides serrated adenomas with a mutator phenotype in the 
MSI tumorigenic pathway j^. Thus, our results reinforce its putative role as driver gene in 
these tumors. 

By comparing these models we can find similarity in the prediction of a potential new early event 
for CRC formation, fbxw7, as other authors have recently described [^. This tumor suppressor 
is frequently inactivated in human cancers, yet the molecular mechanism by which it exerts its 
anti-tumor activity remains unexplained [^, and our models provide a new hypothesis in this 
respect. 

4 Discussion 

This paper represents our continued exploration of the nature of somatic evolution in cancer, and 
its translational exploitation through models of cancer progression, models of drug resistance (and 
efficacy), left- and right-censoring, sample stratification, and therapy design. Thus this paper em¬ 
phasizes the engineering and dissemination of production-quality computational tools as well as 
validation of its applicability via use-cases carried out in collaboration with translational collabo¬ 
rators: e.g., colorectal cancer, analyzed jointly with epidemiologists currently studying the disease 
actively. As anticipated, we reasserted that the proposed model of somatic evolution in cancer not 
only supports the heterogeneity seen in tumor population, but also suggests a selectivity/causality 
relation that can be used in analyzing (epi)genomic data and exploited in therapy design - which we 
introduced in our earlier works [^[4Q] . In this paper, we have introduced an open-source pipeline, 
PicNic, which minimizes the confounding effects arising from inter-tumor heterogeneity, and we have 
shown that PicNic can be effective in extracting ensemble-level evolutionary trajectories of cancer 
progression. 

When applied to a highly-heterogeneous cancer such as colorectal, PicNic was able to infer 
the role of many known events in colorectal cancer progression (e.g., APC, KRAS or tp53 in MSS 
tumors, and BRAF in MSI ones), confirming the validity of our approaclQ Interestingly, new players 
in CRC progression stand out from this analysis such as fbxw7 or axin2, which deserve further 
investigation. In colon carcinogenesis, although each model identifies characteristic early mutations 
suggesting different initiation events, both models appear to converge in common pathways and 
functions such as wnt or mapk. 

^As a further investigation for CRC, we leave as future work to check whether the inferred progression are also 
representative of other subtyping strategies for colorectal cancer, with particular reference to recent works which 
show marked interc onn ectivity between different independent classification systems coalescing into four consensus 
molecular subtypes [99] . 
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However, both models have some clear distinctive features. Specific events in MSS include 
mutations in intracellular genes like ctnnbI or in pten, a well-known tumor suppressor gene. On 
the contrary, specific mutations in MSI tumors appear in membrane receptors such as ACVRlB, 
acvr2a, erbb3, lrp5, tgfbrI and tgfbr2, as well as in secreted proteins like igf2, possibly 
suggesting that such tumors need to disturb cell-cell and/or cell-microenvironment communication 
to grow. At the pathway level, genes exclusively appearing in the MSI progression model accumulate 
in specific pathways such as cytokine-cytokine receptor, endocytosis and TGF-/3 signaling pathway. 
On the other hand, genes in MSS progression model are implicated in p53, mTOR, sodium transport 
or inositol phosphate metabolism. 

Our study also highlighted the translational relevance of the models that we can produce with 
PicNic (see Supplementary Figure S12). The evolutionary trajectories depicted by our models can, 
for instance, suggest previously-uncharacterized phenotypes, help in finding biomarker molecules 
predicting cancer progression and therapy response, explain drug resistant phenotypes and predict 
metastatic outcomes. The logical structure of the formulas describing alterations with equivalent 
fitness (i.e., the exclusivity group) can also point to novel targets of therapeutic interventions. 
In fact, exclusivity groups that are found to have a role in the progression can be screened for 
synthetic lethality among such genes - thus explaining why we do not observe phenotypes where 
such alterations co-occur. In this sense, our models describe also such clonal signatures which, 
though theoretically possible, are not selected. We call such conspicuously absent phenotypes anti¬ 
hallmarks |100| . 

Our models have other applications to both computational and cancer research. Our models, 
as encoded by Suppes-Bayes Causal Networks could be used as informative generative models for 
the genomic profiles for the cancer patients. In fact, as known in machine learning, such generative 
models are extremely useful in creating better representation of data in terms of, e.g., discriminative 
kernels, such as Fisher |1Q1| . In practice, this change of representations would allow framing common 
classification problems in the domain of our generative structures, i.e., the models, rather than the 
data. As a consequence, it is possible to create a new class of more robust classification and 
prediction systems. 

One may think of these representations as those bringing us closer to phenotypic (and causal) 
representation of the patient’s tumor, replacing its genotypic (and mutational) representation. We 
suspect that such representations will improve the accuracy of measurement of the biological clocks, 
dysregulated in cancer and critically needed to be measured in order to predict survival time, time 
to metastasis, time to evolution of drug resistance, etc. We believe that these “phenotypic clocks” 
can be used immediately to direct the therapeutic intervention. 

Clearly, applicability and reliability of techniques such as PicNic is very much dependent on 
the background of data available. At the time of this writing, the quality, quantity and reliability 
of (epi)genomic data available, e.g., in public databases, is related to the ever increasing com¬ 
putational and technological improvements characterizing the wide area of cancer genomics. Of 
similar importance is the availability of wet-lab technologies for models validation. Our recent work 
on SubOptical Mapping technology, for instance, points to the ability to cheaply and accurately 
characterize translocation, indels and epigenomic modifications at the single molecule and single 
cell level |1Q2 1Q3| . This technology also provides the ability to directly validate (or refute) the 
hypotheses generated by PicNic via gene-correction and single cell perturbation approaches. 

To conclude, the precision of any statistical inference technique, including PicNic, is influenced 
by the quality, availability and idiosyncrasies of the input data - the goodness of the outcomes 
improving along with the expected advancement in the field. Nevertheless, the strength of the 
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proposed approach lies in the efficacy in managing possibly noisy/ biased or insufficient data, and 
in proposing refutable hypotheses for experimental validation. 

5 Materials and methods 

Processing COADREAD samples with PiCnIc. We instantiated PicNic to process clinically 
annotated high MSI-HIGH and MSS colorectal tumors collected from The Cancer Genome Atlas 
project “Human Colon and Rectal Cancer” (COADREAD) - see Supplementary Figure SI. 
Details on the implementation and the source code to replicate this study are available as Supple¬ 
mentary Material. COADREAD has enough samples, especially for MSS tumors, to implement a 
consistent and significant statistical validation of our findings - see Supplementary Table SI. 

In brief, we split subtypes by the microsatellite status of each tumor as annotated by the con¬ 
sortium (so, step I of PicNic is done by exploiting background knowledge rather than computational 
predictors). It should be expected that if this step is skipped or this classification is incorrect, the 
resulting models would noticeably differ. Once split into groups, the input COADREAD data is 
processed to maintain only samples for which both high-quality curated mutation and CNA data 
are available; for CNAs we use focal high-level amplifications and homozygous deletions. 

Then, for each sample we select only alterations (mutations/CNAs) from a list of 33 driver genes 
manually annotated to 5 pathways in - wnt, raf, tgf-/3, pi3k and p53 (Supplementary Figures 
S2 and S3). This list of drivers, step II of PicNic, is produced by TCGA, as a result of manual 
curation and running MutSigCV. 

In the next module of the pipeline, we fetch groups of exclusive alterations. We scanned these 
groups by using the MUTEX tool (Supplementary Table S2), and merged its results with the 
group that TCGA detected by using the MEMO tool, which involves mainly genes from the pi3k 
pathway. Knowledge on the potential exclusivity among genes in the wnt (apc,ctnnb1) and raf 
(kras,nras,braf) pathways was exploited as well. Groups were then used to create CAPRTs 
formulas; we also included hypotheses for genes which harbour mutations and homozygous deletions 
across different samples, see Supplementary Table S3. Data and exclusivity groups for MSS tumors 
are shown in Supplementary Figure S4 and S5. 

CAPRI was run, as the last step of PicNic, on each subtype, by selecting recurrent alterations 
from the pool of 33 pathway genes and using both AIC/BIC regularizer. Timings to run the relevant 
steps of the pipeline are reported in the Supplementary Material. In the models of Figures]^ and 
Figure each edge mirrors selective advantage among the upstream and downstream nodes, as 
estimated by CAPRI; Mann-Withney U test is carried out with statistical significance 0.05, after 
100 non-parametric bootstrap iterations. 

The significance of the reconstructed models and the input data is assessed by computing all the 
statistics/tests discussed in the Main text (temporal priority, probability raising and hypergeometric 
testing p-values, bootstrap and cross-validation scores). Motivation and background on each of 
these measures is available in the Supplementary Materials. A table with their values for edges 
with highest non-parametric bootstrap scores is in Supplementary Figure S8. 

For the MSS cohort all the p-values are strongly significant (p<o.oi) except for the temporal 
priority of the edges connecting mutations in fam123b and atm, and erbb2 alterations (mutations 
and amplifications), which leads us to conclude that, even if these pairs of genes seem to undergo 
selective advantage, the temporal ordering of their occurrence is ambiguous and failed to be imputed 
correctly from the datasets, analyzed here. The same situation occurs in MSI-HIGH tumors, for the 
relation between kras and erbb3. Non-parametric and statistical bootstrap estimations are used 
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to assess the strength of all the findings (Supplementary Figures S6 and S7). Moreover, any bias 
in the data is finally evaluated by cross-validation (Supplementary Figures S8-S11) and common 
statistics such as entropy loss, posterior classification and prediction errors. In general, most of the 
selective advantage relations depicted by the inferred models present a strong statistical support, 
with the MSS cohort presenting the most reliable results. 

Summary implementation for COADREAD (PicNic steps. Figure [^: (1) TCGA clinical classi¬ 
fication, (2) MutSigCV and TCGA manual curation, (3) MEMO, MUTEX and knowledge of wnt 
and RAF pathways and (4) CAPRI. 

Implement your own case study with PiCnIc/TRONCO. TRONCO started as a project 
before PicNic, and is our effort at collecting, in a free R package, algorithms to infer progression 
models from genomic data. In its current version it offers the implementation of the CAPRI 
and CAPRESE algorithms, as well as a set of routines to pre-process genomic data. With the 
invention of PicNic, it started accommodating software routines to easily interface CAPRI and 
CAPRESE to some of the tools that we mention in Eigure In particular, in its current 2.0 
version it supports input/output for the Matlab Network Based Stratification tool (NBS) and 
the Java MUTEX tool, as well as the possibility to fetch data available from the cBioPortal for 
Cancer Genomics (http://cbioportal.orghttp://cbioportal.org), which provides a Web resource 
for exploring, visualizing, and analyzing multidimensional cancer genomics data. 

We plan to extend TRONCO in the future to support other similar tools and become an integral 
part of daily laboratory routines, thus facilitating application of PiCnIc to additional use cases. 
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Figure 1: A. Problem statement, (left) Inference of ensemble-level cancer progression models from 
a cohort of n independent patients (cross-sectional). By examining a list of somatic mutations or 
CNAs per patient (0/1 variables) we infer a probabilistic graphical model of the temporal ordering 
of fixation and accumulation of such alterations in the input cohort. Sample size and tumor het¬ 
erogeneity complicate the problem of extracting population-level trends, as this requires accounting 
for patients’ specificities such as multiple starting events, (right) For an individual tumor, its clonal 
phylogeny and prevalence is usually inferred from multiple biopsies or single-cell sequencing data. 
Phylogeny-tree reconstruction from an underlying statistical model of reads coverage or depths es¬ 
timates alterations’ prevalence in each clone, as well as ancestry relations. This problem is mostly 
worsened by the high intra-tumor heterogeneity and sequencing issues. B. The PiCnIc pipeline for 
ensemble-level inference includes several sequential steps to reduce tumor heterogeneity, before ap¬ 
plying the CAPRI algorithm. Available mutation, expression or methylation data are first used 
to stratify patients into distinct tumor molecular subtypes, usually by exploiting clustering tools. 
Then, subtype-specific alterations driving cancer initiation and progression are identified with sta¬ 
tistical tools and on the basis of prior knowledge. Next is the identification of the fitness-equivalent 
groups of mutually exclusive alterations across the input population, again done with computa¬ 
tional tools or biological priors. Finally, CAPRI processes a set of relevant alterations within such 
groups. Via bootstrap and hypothesis-testing, CAPRI extracts a set of “selective advantage rela¬ 
tions” among them, which is eventually narrowed down via maximum likelihood estimation with 
regularization (with various scores). The ensemble-level progression model is obtained by combining 
such relations in a graph, and its confidence is assessed via various bootstrap and cross-validation 
techniques. 


22 




























PiCnIc 

Pipeline for Cancer Inference 

MOTIVATION 

Mutations — 

Copy Number ^ 

Expression P 

Mehtylations ^ 

> 

Other » 

COMPUTATIONAL OPTIONS AND PRIOR KNOWLEDGE 3§ 

EXPECTED OUTPUT AND ACTION TO 

EXECUTE 

1 Cohort subtyping 

Determine molecular subtypes 
likely to progress through 
different trajectories 

/ / / 

Non-negative Matrix Factorization 
(NMF), k-Means, Gaussian Mixtures, 
Hierarchical/Spectral Clustering, 
Network Based Stratification (NBS) 

Biomarkers (cell types, known 
mutations, ....), Clinical Annotations 
(Mutation Status, Chromosomal 
Stability, ...) 
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likely to drive progression 
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A rank of genes and In each cluster, restrict to 

their alterations consider only driver events 
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* Data marked as X can be used when it is persistent (i.e., do not revert back to their original state) during tumor progression. Other: data not common to most tumor types such as fusions or partial tandem duplication. 

Not all tools support all the data that is theoretically usable for a certain step. 

★ CAPRI is the only algorithm to exploit knowledge provided by step 3 via logical formulas hypotheses-testing. Oncotrees, Distance-based, Mixtuers and CAPRESE are constrained to infer at most tree-models of progression. 


Figure 2: The PiCnIc pipeline. We do not provide a unique all-encompassing rationale to instantiate 
PiCnIc as all steps refer to research area currently development, where the optimal approach is 
often dependent on the type of data available and prior knowledge about the cancer under study. 
References are provided for each tool that can be used to instantiate PiCnIc: NMF [^, k-Means, 
Gaussian Mixtures, Hierarchical/Spectral Clustering 1^, NBS (^, MutSigCV [^, OncodriveFM 


[^ , OncodriveCLUS T [TO] , MuSi C [tT] Oncodrive-CIS Intogen [7^, Ratio |74|, R ME [75 


|82|, ME |83|, CAPRI j40|, CAPRESE |39|, Oncotrees |31 33 , Distance-based |32|, Mixtures |34| 
CBN |3^[W, Resic [3711^4 BML 
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Figure 3: A. MSI-HIGH colorectal tumors from the TCGA COADREAD project [^, restricted 
to 27 samples with both somatic mutations and high-resolution CNA data available and a selection 
out of 33 driver genes annotated to wnt, ras, pi3k, TGF-/3 and p53 pathways. This dataset 
is used to infer the model in Figure B. Mutations and CNAs in MSI-HIGH tumors mapped 
to pathways confirm heterogeneity even at the pathway-level. C. Groups of mutually exclusive 
alterations were obtained from - which run the MEMO tool - and by MUTEX tool. In 
addition, previous knowledge about exclusivity among genes in the RAS pathway was exploited. D. 
A Boolean formula input to CAPRI tests the hypothesis that alterations in the RAS genes KRAS, 
NRAS and BRAF confer equivalent selective advantage. The formula accounts for hard exclusivity 
of alterations in NRAS mutations and deletions, jointly with soft exclusivity with KRAS and NRAS 
alterations. 
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Progression of MSS tumors in TCGA COADREAD 



Figure 4: Selective advantage relations inferred by CAPRI constitute MSS progression; input 
dataset in Supplementary Figure S3 and S4. Formulas written on groups of exclusive alterations, 
e.g., SOx9 amplifications and mutations, are displayed in expanded form; their events are connected 
by dashed lines with colors representing the type of exclusivity (red for hard, orange for soft), logical 
connectives are squared when the formula is selected, and circular when the formula selects for a 
downstream node. For this model of MSS tumors in COADREAD, we find strong statistical support 
for many edges (p-values, bootstrap scores and cross-validation statistics shown as Supplementary 
Material), as well as the overall model. This model captures both current knowledge about CRC 
progression - e.g, selection of alterations in pi3k genes by the KRAS mutations (directed or via the 
MEMO group, with BIC) - as well as novel interesting testable hypotheses - e.g., selection of SOx9 
alterations by fbxw7 mutations (with BIC). 
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Figure 5 : A. Selective advantage relations inferred by CAPRI constitute MSI-HIGH progression; 
input dataset in Figure Formulas written on groups of exclusive alterations are expanded as 
in Figure For each relation, confidence is estimated as for MSS tumors and reported as Sup¬ 
plementary Material. In general, this model is supported by weaker statistics than MSS tumors - 
possibly because of this small sample size (n=27). Still, we can find interesting relations involving 
APC mutations which select for pikSca ones (via BIC) as well as selection of the MEMO group 
(erbb2/pik3ca mutations or igf2 deletions) predicted by AIC. Similarly, we find a strong selection 
trend among mutations in erbb2 and KRAS, despite in this case the temporal precedence among 
those mutations is not disentangled as the two events have the same marginal frequencies (26%). B. 
Evolutionary trajectories of clonal expansion predicted from two selective advantage relations in the 
model. APC-mutated clones shall enjoy expansion, up to acquisition of further selective advantage 
via mutations or homozygous deletions in NRAS. These cases should be representative of different 
individuals in the population, and the ensemble-level interpretation should be that “APC mutations 
select for NRAS alterations, in hard exclusivity” as no sample harbour both alterations. A similar 
argument can show that the clones of patients harbouring distinct alterations in acvrIb - and 
different upstream events - will enjoy further selective advantage from mutation in the tgfbr2 
gene. 
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A Reproducing this study 
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B Glossary 

This glossary of terms shall be of help to readers not familiar with the concepts mentioned in the 
Main Text. For clarity, terms are separate in two categories according to the fact that they are 
common to the statistics or the cancer biology communities. Each term which is included in this 
glossary appears in color. 


Term 

Terms common to the statistics community 

Meaning 

Boolean for¬ 
mula 

In CAPRI, a formula written with standard logical operators which capture a relation among 
a group of alterations. In PicNic, these are used to detect alternative routes to selective 
advantage from mutually exclusive alterations. See: Fitness equivalence. 

Ensemble 
level progres¬ 
sion inference 

Detection of the relations of selective advantage across the permanent alterations in 
a cohort of independent tumors (cross-sectional data). When aggregated in a graphi¬ 
cal model, these shall picture the most common evolutionary trajectories in the popula¬ 
tion/cancer under study. See also: inter-tumor heterogeneity. 

Graphical 

models 

In this context, a direct acyclic graph with nodes (alterations) and edges (selective ad¬ 
vantage relations), as a shorthand to represent the joint probability of observing a set of 
alterations in a sample (i.e., a cancer genotype). See also: Suppes-Bayes Causal Net¬ 
work, Model selection. 

Individual 
level progres¬ 
sion inference 

Detection of clonal signatures and their prevalence in individual tumors by scanning multi¬ 
region or single-cell sequencing data; clones are then displayed in a phylogenetic tree 
structure. See also: intra-tumor heterogeneity. 

Model selec¬ 
tion 

The process of selecting a model which fits data, according to some criterion. In CAPRI, this 
is done by balancing model likelihood (a measure of to which degree data can be explained by 
the model) and model complexity (the size of the graphical model). See: regularization. 

Phylogenetic 

tree 

In this context, rooted tree where each node is a clone, and edges represent ancestry relations 
among clones. 

Regularization 

Common approach to avoid overfitting (false-positives) during model selection - in CAPRI 
this is achieved by using the standard AIC/BIC which penalize with different severity graph¬ 
ical models which contain many selective advantage relations. 

Simposon’s 

paradox 

A paradox in statistics, in which a trend appears in different groups of data but disappears 
or reverses when these groups are combined. In this context, this shall refer to genuine 
selective advantage relations which are not inferred unless data coming from different 
populations is separated before doing inference. See: heterogeneity, subtypes, formulas 

Suppes-Bayes 
Causal Net¬ 
works 

A specific type of graphical model returned by CAPRI algorithm, where each edge satisfies 
Suppes’s conditions of probabilistic causation subsuming temporal ordering and positive 
statistical dependence - the statistical approach to estimate selective advantage among 
the alterations. 
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Term 

Terms common to the cancer biology community 

Meaning 

Alterations 

Somatic mutations: A change in the genome of a cell that is not inherited from a parent, but 
is acquired. CNVs: Structural variation of large regions of DNA segments, including deletions, 
insertions, duplications and complex multi-site variants. 

Bulk se- 

Genome sequencing from single tumor samples, each containing a large number of cells. The 

quencing 

resulting genomic profiles are derived from a mixture of cells with potentially distinct evolutionary 
histories. 

Clones; 

Clonal ex- 

Clone: group of cells sharing an identical genome and that derive from a common ancestor. Clonal 
expansion: the production of descendent cells all arising originally from a single cell. In the scenario 

pansion 

of cancer development, tumors develop through a series of clonal expansions, in which the most 
favorable clonal population survives and proliferate. 

Cross- 

sectional 

data 

Unique snapshots of data derived from samples that are collected at unknown time points. Usually 
derived from bulk sequencing technologies. 

Driver; Pas- 

Driver: (epi)genetic alteration that provides a selective advantage to a cancer clone. Passen- 

senger 

ger: alteration of a cancer cell that does not increase its fitness. 

Exclusivity 
of alter¬ 

ations 

Group of alterations which manifest few or no co-occurences in a cohort of different samples, and 
might be fitness-equivalent for tumor progression. Hard exclusivity: when co-occurrences shall be 
considered the result of random errors. Soft exclusivity: when few co-occurrences shall be possible. 
See: formulas. 

Fitness 

A cell’s ability of surviving, proliferating and adapting to environmental changes, usually within an 
environment with limited and depleting resources (e.g., oxygen or nutrients). 

Fitness 

equivalence 

Groups of driver alterations, functional to the same pathway or equally dysruptive, that can inde¬ 
pendently confer a selective advantage to a cancer cell. Multiple co-occurrence of such alterations 
to provide no further advantage, hence leading to mutually exclusive alteration patterns across 
distinct samples. 

Hallmark of 

Gommon traits or phenotypic properties that are supposed to drive the transformation of normal 

cancer 

cells to cancer cells. Anti-hallmark: clonal profiles that are usually not observed, yet being 
theoretically possible. 

Inter-tumor 

heterogene¬ 

ity 

The phenomenon according to which different patients with the same cancer type usually display 
a few common alterations. This is the major problem of inferring ensemble-level cancer pro¬ 
gression models. 

Intra-tumor 

heterogene¬ 

ity 

Intra-tumor heterogeneity is related to possible coexistence of different cancer clones, with different 
evolutionary histories and different mutational profiles, within the same tumor. This is the major 
problem of inferring individual-level cancer progression models. 

Multiregion 

sequencing 

Gollection of genomic data obtained by processing multiple spatially separated biopsy samples from 
the same individual tumor. 

Next Gen¬ 
eration 
Sequencing 
(NGS) 

New technologies for sequencing genomes at high speed and low cost, including, e.g., full- 
genome/exome sequencing, genome resequencing, transcriptome profiling (RNA-Seq), DNA-protein 
interactions (GhIP-Seq), and epigenome characterization. 

Selective ad¬ 
vantage rela¬ 
tion 

In successive waves of clonal expansions one or more cells of the same clone can (progressively) 
increase their fitness throu^ the acquisition of additional driver alterations, leading to the 
emergence and development of a fitter clone. In this case a relation of selective avantage connects 
the earlier to the succeeding alterations. 

Single-cell 

sequencing 

Recent technology based on the retrieval and analysis of genomic information from individual cells, 
rather than from mixtures of cells. 

Synthetic 

lethality 

The phenomenon according to which two otherwise non-lethal alterations lead the cell death when 
they co-occur within the same cell. See: Anti-hallmark 




















C PicNic’s implementation for COADREAD samples 

Here we detail all the steps implemented to use PicNic for CRC progression inference. 

C.l TCGA COADREAD project data 

COADREAD provides genome-scale analysis of samples with exome sequence, DNA copy number, 
promoter methylation, messenger RNA and microRNA expression data, which we used to define 
the input dataset. In particular, only samples with both mutations and CNAs profiles were used in 
the analysis. Supplementary Table |ST| details the dataset. 

Dataset used to infer models presented in the Main Text. Samples published in were 
used as, to the best of our knowledge, these represent the highest-quality data made available by 
COADREAD as of today; for these samples TCGA provides somatic mutation profiles and high- 
resolution focal CNAs via GISTIC. These are obtained from TCGA data freeze as of 2 Eebruary 
2012, downloaded on 12 March 2015, from repository: 

https://tcga-data.nci.nih.gov/docs/publications/coadread_2012/ 

The following files were processed to produce the data: 

• TCGA_CRC_Suppl_Table2_Mutations_201 2071 9.xlsx 

Somatic mutations profiles obtained via whole-exome sequencing of 224 colorectal tumors by 
TCGA. Data available consists of 15995 mutations in 228 samples, provided in the Man¬ 
ual Annotation Format (MAE). Samples were selected to univocally match the 224 pa¬ 
tients as of the TGGA guidelines for aliquote disambiguation, see https://wiki.nci.nih. 
gov/display/TCGA/TCGA+barcode, All the mutations annotated by TCGA - truncating 

(De_novo_Start_OutOfFrame, Frame_Shift_Del, Frame_Shift_ln, Nonsense_Mutation,Splice_Site, 
Frame_Shift_lns, ln_Frame_Del), silent (Silent) and missense (Missense_Mutation) - were 
considered for analysis; notice that the majority of them are missense, see Eigures [S7| and [S8| 

• crc_gistic.txt.zip 

Eocal Copy Number Alterations (GNAs) for 564 patients derived from whole-genome sequenc¬ 
ing using the Illumina HiSeq platform. High-level gains and homozygous deletions were con¬ 
sidered for analysis by selecting entries with GISTIG scores ± 2; 

• crc_clinical_sheet.txt 

Clinical data summary with patient stage and Micro Satellite Stabe/Unstable (MSS/MSI) 
status being any of: MSS, MSI-high and MSI-low. 

The list of patients used was first reduced to those having both CNAs and somatic mutation 
data, and then was split into two groups: MSI-HIGH and MSS. The training cohort has 152 MSS 
and 27 MSI-HIGH samples; samples flagged as low MSI were excluded from the study as they have 
not been shown to differ in their clinicopathologic features or in most molecular features from MSS 
tumors [?]. 
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C.2 Driver events selection 


In the TCGA COADREAD study integrated analysis of mutations, copy number and mRNA 
expression changes in 195 tumours with complete data was performed. Part of the analysis was 
carried out by using the MutSig tool [^, as well as manual curation. Samples were grouped by 
mutation status, and recurrent alterations in key CRC pathways were identified in (Fig. 4, 
Supplementary Fig. 6 and Supplementary Table 1) as a result, we can use the consortium’s list of 
33 driver genes annotated to 5 pathways and use these to extract our progression models. These 
are well-known cancer genes, frequently reported as relevant to colorectal progression and to the 
major pathways involved in CRC. Driver events are alterations in: 

• WNT genes (14): APC, DKK-4, tcf7l2, ctnnbI, lrp5, fbxw7, dkk-1, fzdIO, aridIa, 
dkk-2, fam123b, sox9, dkk-3 and axin2; 

• rtk/ras genes (5): erbb2, erbb3, nras, eras and braf; 

• tgf-13 genes (5): tgfbrI, smad3, tgfbr2, smad4, agvrIb, agvr2a and smad2; 

• igf2/pi3k genes (5): igf2, irs2, pik3ga, pik3r 1 and pten; 

• p53 genes (2): tp53 and ATM. 

In the Main Text, rtk/ras and igf2/pi3k pathways are shortly denoted as RAS and pi3k. 

The distinct types of mutations detected in these genes are shown in Figure [S7| as well as the 
overall rate of COADREAD mutations. The spatial distribution (per gene) of such mutations is 
shown in Figure [SS] 

C.3 Mutual exclusivity groups of alterations. 

Croups of alterations showing a trend of mutual exclusivity were scanned with MUTEX and mu¬ 
tations and CNA hitting any of the 33 selected genes as input. MUTEX was run independently 
on MSS and MSI-HICH groups (Supplementary Table [S^ running times: approximately 6 and 3.5 
hours, respectively, on a standard Desktop machine). 

We selected only groups with score <0.2, where the score is derived from p-values corrected 
for false discovery rate. 3 groups are found for MSI-HICH tumors and 6 for MSS. For MSI-HICH 
tumors, the three predicted groups consists of genes agvrIb, acvr2a, tp53 and erbb2, of genes 
BRAF, NRAS and TGFBR2, and of genes KRAS and braf. 

Further groups of exclusive alterations were considered consistent with results reported in [^ . 
These include groups derived by consolidated knowledge of colorectal progression: the well-known 
WNT alterations in apg/gtnnbI [?], as well as RAS alterations in KRAS, NRAS and braf genes [74] . 
Similarly, we used also a group collected by scanning non-hypermutated tumors with the MEMO 
tool in [^ - this group includes pik3ca, pten, erbb2 and igf2 genes. These groups were restricted 
to account only for genes actually altered in a certain subtype, e.g., MSI-HICH tumors lack gtnnbI 
mutation, making the Wnt group irrelevant. Croups for MSS tumors are shown as Supplementary 
Figure |S9l groups for MSI-HICH tumors are in the Main Text. 
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C.4 CAPRI’s execution 


Background on the algorithm. CAPRI algorithm can be executed in two different modes, 
originally dubbed as “supervised” when formulas are given in input for testing, and “unsupervised” 
when this is not the case. This paper deals with the former; see the Main Text for the interpretation 
of formulas in this context and for a derivation of the algorithm. CAPRI is a three-steps 
procedure, which we briefly recall here. 

1. CAPRI starts by creating a “lifted” representation of the input data M which includes the 
input formulas; each formula - which is written in propositional logic - is so evaluated to 
yield a new column in the dataset. This is the input processed by the algorithm, which starts 
by selecting a set of candidate model edges, which are then used to constrain a score-based 
Bayesian model-selection problem [^ . 

2. The initial set of selective advantage relations 5, which determines the model edges x ^ 
is computed by evaluating the following inequalities 

(Temporal priority) p{x) > p{y) 

(Probability raising) p{y \ x) > p{y \ ^x ), 

where p(') is a marginal probability, p{' \ •) is a conditional, ^x is the negation of x and either 
X or ^ is not a formula. It is interesting to observe that probability raising implies that 
X and y are positively statistically dependent^ thus imposing a minimum threshold on their 
association [^ . 

In each of CAPRTs executions, the distribution of the observed marginals and conditionals are 
estimated by K non-parametric bootstrap resamples; practically, this means that we create a 
bootstrapped approximation for each of the four populations p(x), p(^), p{y \ x) andp(^ | ^x). 
Then, we use a single-tail non-parametric Mann-Withney U test of the difference in mean to 
test the hypothesis that one of the two populations is more probable than the other, e.g., 
p{x) > p{y). With this test, we can compute two p-values, one for each condition. An edge is 
included in S if at least the p-value for probability raising is below a significance threshold 
p^ and an edge is said not to be orientable if its p-value for temporal priority is above the 
same threshold. Cycles xi ^ X 2 ^ . x^ ^ xi that might appear in S are broken by deleting 

the edge with minimum p-value; both K and p^ are custom parameters. 

3. Optimization of 5, namely detection of the subset 5* of S with the edges that we include in 
the final progression model is done by optimizing, via hill climbing or tabu search^ the score 
with regularization 


5* = argmin |—21og[>C(5 | M)] -1- ^|5|| , (1) 

where £(•) is the model likelihood and M is the input data; the estimated optimal solution is 
5*, which is displayed as a Suppes-Bayes Causal Network. The different regularization strate¬ 
gies mentioned in the Main Text, BIC and AIC, are obtained by the following parametrization: 

(Bayesian Information Criterion) 0 = log(n) 

(Akaike Information Criterion) 0 = 2. 
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Besides edges, a model has a set of parameters d which define the conditional probability table 
of each edge and should be fit from data; these are necessary if one wishes to use a model as a 
“generator” of further data. For discrete-valued graphical models, for each parent set tTx 
parameter 0{y) = p{y \ tTx) can be taken either as the maximum likelihood estimate from the 
lifted input, or by using a Bayesian interpretation [^ . 

Usage in this context. CAPRI was run, on each group of tumors, by selecting alterations from 
the pool of 33 pathway genes; every alteration on a gene x is included if any of these apply: 

• the alteration frequency of x - sum of mutation and CNA frequency - is greater than 5%; 

• X it is part of an exclusivity group. 

The set of selected events for MSI-HIGH training tumors is shown in the Main Text, the analogous 
set for MSS tumors is shown as Supplementary Figure |STo| 

CAPRI was executed in its supervised mode by writing formula over groups and genes with 
multiple alterations associated, as explained in the Main Text. For instance, for MSI-HIGH tumors 
with alterations in RAS pathway we grouped hard exclusivity of NRAS mutations and deletions, with 
soft exclusivity of KRAS and braf mutations. Our aim was to account for a small subset of samples 
with concurrent KRAS and NRAS alterations (see Figure 2, Main Text). The list of all Boolean 
formulas written over groups is in Table |S3[ this approach was adopted also when a gene harbors 
multiple alterations in a subtype, e.g., erbb2 in MSS training samples which shows a trend of soft 
exclusivity between mutations and amplifications. We used both AIC and BIC scores to regularize 
inference after 100 non-parametric bootstrap iterations for estimation of the preliminary selective 
advantage relations - Mann-Whitney U test was performed with a minimum threshold p^ = 0.05. 
In most cases p-values are orders of magnitude below p^ - exact values reported as Dataset File SI. 
CAPRFs models with such p-values and non-parametric bootstrap confidence are shown in Figures 
m and uni statistical validation of the models is discussed in the next section. 

D Statistical validation of the models 

P-values from hypothesis testing, as well as scores from k-fo\d cross-validation and various boot¬ 
strapping techniques can be used to measure the statistical consistency of models and data, each one 
capturing different potential errors in the inference process. Approaches such as cross-validation 
and bootstrap are sometimes also used in the {ex novo) generation and inference of models from 
data (see, e.g., bootstrap consensus models [?]), but we only use them here for the a posteriori 
evaluation of a model’s confidence, and we interpret them as a quantitative measure for the relative 
assessment of each model’s relation. 

All the p-values and the scores that we present here are computed within TRONCO. 

D.l Edge p-values 

As explained in §C.4[ for each model edge x ^ y we get two p-values by assessing temporal priority 
and probability raising via Mann-Whitney U testing. 

For each edge, a p-value for the hypergeometric test of overlap between alteration profiles x and 
y can be computed. More precisely, we test if there is a difference between the number of samples 
containing both x and y versus the total population of samples with x, y, or both. We would like 
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the overlap to be significant as those samples - that determine the joint probability of x and y - 
are those supporting the presence of a selection trend among x and y. 

An edge is fully supported if all three p-values are below a custom significance threshold, e.g., 
0.05 or even better 0.01. Some edges might have the p-value for temporal priority above the 
threshold. If so, the selection trend might be still significant, but the temporal order of x and y - 
i.e., the direction of selection - is not supported by the data. 


D.2 Bootstrap 

We used non-parametric and statistical bootstrap techniques to measure the goodness-of-fit^ as 
originally proposed in [?]. In this case, we distinguish two type of errors that one could make in 
the inference process, estimating the presence or absence of edges in the model: 

• Type I errors: incorrect rejection of a true Hq (null-hypthesis), i.e., a ^false positive^’ edge 
that we wrongly include. 


• Type II errors: incorrect acceptance (failure to reject) of a false i.e., a ^false negative’’ 
edge that we miss. 


Non-parametric bootstrap computes scores to be interpreted here as follows. If a model 
contains an edge x ^ y that is a true positive, we expect its score to be high. In other words, when 
we sample with repetition subsets of the original data and re-run the inference process we expect to 
often find models which contain the edge x ^ y. Conversely, for a node y without incoming edges, 
or equivalently for any edge x ^ y which is correctly excluded from a model - a true negative - we 
would expect its bootstrap score to be low. However, this reasoning can be also generalized to whole 
models, where we count how many times we re-infer exactly the same model. Clearly, such scores 
will depend also on the empirical probabilities of the nodes in our data, and their deviation from 
the true probabilities of the phenomenon. So, one might expect rare events to be less frequently 
bootstrapped, which results in a lower estimate; however, such counts can be anyway interpreted 
as measures of repeatability of our finding^ 

The above bootstrap approach depends on two random number generators: one to shuffle data 
(the a posteriori bootstrap), and one to evaluate CAPRTs inequalities via hypothesis testing (the 
internal CAPRTs bootstrap, see C.4). Thus, to ensure that no bias is introduced by the random 


number generators, we performed a statistical bootstrap by holding data fixed, and re-estimating 
CAPRTs inequalities with generators initialized with different seeds. We evaluated the robustness 
of our scores for all edges imputed to be genuine and hence, high-scoring. 

Notice that, in principle, even parametric bootstrap scores could be computed if we used the 
model to generate bootstrapped data [^. However, as the support of the distribution subsumed 
by a model with n nodes consists of 2^ possible outcomes, sampling uniformly from large models 
might be computationally hard. For this reason, and because such scores are overestimates of the 
non-parametric ones, we did not include them in our computation. 

The MSS and MSI progression models are annotated with the non-parametric bootstrap scores 
in Figures [STT] and |ST^ Non-parametric and statistical bootstrap scores for a set of selected edges 


®Bootstrapping techniques have been widely used to gauge uncertainty in estimates, but also subjects of philo¬ 
sophical debate about their precise interpretation, especially when coupled with various significance thresholds - 
unlike the situation with a p-value for a null hypothesis. An exhaustive review on the topic is provided by Soltis 
in [?]. We follow the ideas originally developed in the area of phylogenetic analysis [?], suggesting that the scores 
can be alternatively interpreted as a measure of accuracy of the method or of the robustness of the data [?]. 
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are shown and commented in Figure S13 For the same set of edges, we also report the p-values 


for CAPRFs inequalities assessed in the MSS and MSI progression models (temporal priority and 
probability raising, as a measure of the selectivity among the alterations, and hypergeometric, as a 
measure of the randomness in the overlap of two alteration profiles). Additional comments are in 
the caption. 


D.3 Cross-validation 

Next we study the sufficiency of the data sizes for model inference and its ability to characterize the 
underlying progression {goodness-of-data). Thus, we focus on the Type III errors, which occur when 
the sample size is inadequate or the sample is a poor descriptor for the reference phenomenor]^ 
and thus failing to represent the progression. For this purpose, we used cross-validation with the 
data used for the models built (see the Main Text), and followed the best practices developed by 
the Bayesian Networks community [^ . 

TRONCO exploits the cross-validation routines implemented in the bnlearn package [?]. The 
approach that we adopt is a /c-fold non-exhaustive cross-validation, which we repeat 10 times to 
average its results. Exhaustive strategies might be used for datasets of small sample size. Each run 
of cross-validation consists in computing a loss function for a model; its steps are the followings: 

• split randomly the data in /c = 10 groups, and then repeat the following two steps, for each 
group in turn: 

1. set one of the groups to be the “training” M^r; 

2. merge the others /c — 1 to be the “test” M^e; 

3. by holding fixed the model structure (i.e., the edges in 5*), fit the model parameters 6 
over the training data via maximum likelihood estimates^ compute a score over the test 
Mte (see below) and the corresponding loss; 

• combine the k loss estimates to give an overall loss for data. 

Let Otr be the parameters fit from the training set M^r, and 5* the edges in the model, three 
scores are computed with cross-validation: 

1 . the negative entropy of a model - i.e., the negated expected log-likelihood of the test set for 
the Bayesian network fitted from the training set, that is 

eloss(5*,0tr) = -nC{S\etr I M,e)] • 

2. the prediction error for a single node x and its parents set X, i.e., we measure how precisely 
we can predict the values of x by using only the information present in its local distribution 
6tr{x). This parameter corresponds to computing the misclassification rates from Pte{x)^ the 
empirical marginal probability of x estimated from the test. 

3. the posterior classification error for a single node x and one of its parent node y e X - i.e., 
the values of x are predicted using only the information present in y by likelihood weighting 
and Bayesian posterior estimates. 

® Consider the case where the samples are from two different unknown and heterogeneous groups, with random 
chance making low values to be sampled from a group that actually has a majority of high values, and vice versa. 
In this case, the samples will not be the best descriptor of the two groups. 
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See [?] for a discussion on these loss functions. The first statistics measures the log-likelihood loss 
when we “forget” some of the samples used to infer a model (indeed, the test samples); see Figure 
|S14| Roughly, we are measuring how the model’s predictive power changes as we look at the data 
from different viewpoints. This is a score for a whole model, it has no scale - so cannot be used to 
say how good the models/data are, in any absolute sense. Nonetheless, it can be used to evaluate 
how “stable” a model is, for a certain dataset. 

The second and third statistics measure the accuracy of the parent-set, X, for a child x; the 
second statistics dealing with the whole parent set as predictor, and the third, the individual 
contribution of each of the parents. For these two statistics, we desire the prediction error to be 
low, as a measure of goodness. These are shown in Figure [ST^ (selected edges), S15 and |S16| (all 
edges, prediction error). 


E Supplementary Tables and Figures 


Datasets (CNAs and mutations provided by TCGA) 




statistics 


alteration type 


cancer^ 

n 

m 

|G| 

mutations 

amplifications 

deletions 

MSI-HIGH 

27 

16100 

13798 

11556 

2888 

1656 

MSS 

152 

21317 

16371 

12417 

6925 

1975 


^ Samples were classified as MSI-HIGH/LOW and MSS by TCGA; see fiag 
MSI_status in clinical data available for the COADREAD project. 


Table SI: COADREAD Data. Data used in this study, derived from the TCGA COADREAD 
project [^ . 
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MUTEX parameters 


Parameter 

Value 

Description 

signalling-network 

- 

MUTEX network^ 

max-group-size 

5 

maximum size of a result group 

first-level-random-iteration 

10000 

number of randomisation to estimate null distribu¬ 
tion of member p-values in groups 

second-level-random-iteration 

100 

number of runs to estimate the null distribution of 
final seores 

fdr-cutoff 


false-diseovery-rate eutoff maximising the expeeted 
value of true positives - false positives is estimated 
from data 

search-on-signaling-network 

TRUE 

reduee the seareh spaee using the signalling network 

^ Manually curated from Pathway Commons, SPIKE and SignaLink databases. Provided with the tool; available 
for download at https://code.google.eom/p/mutex/. 


MUTEX groups with score < .2 



MSI-HIGH Groups 

seore 

q-value 

1 

KRAS, BRAF, 

0.095 

0.48 

2 

NRAS, BRAF, TGFBRl 

0.1677 

0.45 

3 

erbb2, tp53, acvrIb, acvr2a 

0.1703 

0.355 


MSS Groups 

seore 

q-value 

1 

tp53, atm. 

0.051 

0.34 

2 

ARIDlA, TP53 

0.075 

0.193 

3 

KRAS, NRAS, BRAF, 

0.0864 

0.1975 

4 

CTNNBl, APC, DKK2, 

0.098 

0.144 

5 

DKKl, tp53, atm, dkk2 

0.1387 

0.176 

6 

pik3ca, tp53, atm 

0.164 

0.207 


Table S2: MUTEX: parameters and results. Top: Parameters used to run MUTEX on the 
original TCGA MSS/MSI-HIGH datasets with input GNA and somatic mutations in the pathway 
genes described in text. Bottom: MUTEX identified 3 and 6 groups of alterations showing a trend 
of mutual exclusivity in these groups with score below the suggested cutoff of 0.2. 
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Formulas input for testing to CAPRI^ 



MSI-HIGH tumors 

description 

1 

(NRASim 0 NRASid) V KRASim V BRAFim 

RAF exclusivity 

2 

PlK3CA:m V ERBB2:m V PTENim V lGF2:d 

MEMO group 

3 

(ACVRlBim 0 ACVRlBid) V ACVR2A:m V TP53:m V ERBB2:m 

MUTEX group 

4 

(NRASim 0 NRASid) V TGFBRlim V BRAFim 

MUTEX group 

5 

KRASim V BRAFim 

MUTEX group 

6 

AGVRlBim 0 AGVRlBia 

multiple alterations 

7 

NRAS:m © NRAS:a 

multiple alterations 

8 

FBXwTim V FBXwTia 

multiple alterations ^ 


MSS tumors 

description 

1 

(APGim 0 APGid) V GTNNBlim 

WNT exclusivity 

2 

(KRASim V KRASia) V (NRASim 0 NRASia) V (BRAFim 0 BRAFia) 

RAF exclusivity and MEMO group 

3 

PIKdGAim V (ERBB2im V ERBB2ia) V (PTENim 0 PTENid) V IGF2ia 

MEMO group 

4 

(TP53im 0 TP53id) V (ATMim 0 ATMid) 

MUTEX group 

5 

(TP53:m 0 TP53id) V ARlDlAim 

MUTEX group 

6 

(TP53:m 0 TP53:d) V ARlDlAim 

MUTEX group 

7 

(APGim 0 APGid) V GTNNBlim V DKK2im 

MUTEX group 

8 

(TP53im 0 TP53:d) V (ATMim 0 ATMid) V DKK2im V DKKlim 

MUTEX group 

9 

(TP53:m 0 TP53:d) V (ATMim 0 ATMid) V PlK3GAim 

MUTEX group 

10 

(APGim 0 APGid) 

multiple alterations 

11 

(TP53:m 0 TP53:d) 

multiple alterations 

12 

(SMADdim 0 SMADdid) 

multiple alterations 

13 

(TGF7L2im 0 TGF7L2id) 

multiple alterations 

14 

(ATMim 0 ATMid) 

multiple alterations 

15 

(NRASim 0 NRASid) 

multiple alterations 

16 

(ERBB2im V ERBB2ia) 

multiple alterations 

17 

(PTENim 0 PTENid) 

multiple alterations 

18 

(SMAD2im 0 SMAD2ia) 

multiple alterations 

19 

(DKKdim 0 DKKdia) 

multiple alterations 

20 

(sox9:m 0 SOx9:d) 

multiple alterations 

21 

(BRAFim 0 BRAFia) 

multiple alterations 


Events type: mutation (m), deletion (d), amplification (a). Hard (0) and soft (V) exclusivity. 

^ Formula not included as it creates a duplicated signature in the dataset. 

Table S3: CAPRI formulas from exclusivit^ groups. Formulas created for the groups, and 
input to CAPRI for testing. These are either derived from exclusivity groups or from genes involved 
in different types of alterations. 



Focal Copy Number Alterations 




THE 

CANCER 

GENOME 

ATLAS 


» 


Somatic mutations 
fl 224 ^15995 


12^ 

^ HH-gfW wnwiiffB 

High-level Gain Homozygous Deletion 
0 564 ^10184 O 564 8174 


COADREAD 2012 

https://tcga-data.nci.nih.gov/docs/pubHcations/coadreacl_2012 

f available ^ number of 

information ^ events (m) 

§■ pSoT”' S 

★ Executed by the Consortium 


samples with both mutations and CNA data available ^ 


n!" ii 

TRONCO 

PiCnIc 

Pipeline for Cancer Inference 

ttii# o’ fit 



RECONSTRUCTION 

progression model Data 

Formulas 


CAPRI - TRONCO 



VALIDATION 


Wilcoxon 

p-values 


:>C 


Cross-validation III | Bootstrap 

errors, entropy riTl nonpar, stat 


Literature 




Figure S6: PicNic pipeline processing MSI-HIGH/MSS tumors. We process with PicNic 
Microsatellite Stable and highly unstable tumors collected from the The Cancer Genome Atlas 
project “Human Colon and Rectal Cancer” (subtypes annotations provided as clinical data). We 
implement a study on selected somatic mutations and focal CNAs in 33 driver genes manually 
annotated with 5 pathways in the COADREAD project. We scan groups of exclusive alterations 
with computational tools run by us and by TCGA, and we exploit previous knowledge on CRG; we 
select which alterations we input to CAPRI. Ne^, inference is performed with various settings of 
regularization and confidence. Statistical confidence of the models is assessed with standard tech¬ 
niques from the literature (p-values from statistical testing, bootstrap scores and cross-validation 
statistics). 
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Figure S7: Mutations annotated by TCGA for the driver genes. Top: the majority of 
the mutations annotated by TCGA for the driver genes that we consider are missense - this in 
almost all genes and in all the cohort. Bottom: overall summation of the frequencies determine the 
mutations across all driver genes. 
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• Missense • Truncating • Inframe deletion/insertion • Other 


iiS^cBioPortal 

- ' for Cancer Genomics 


Figure S8: Lolliplot diagrams of TCGA mutations. Diagrams generated from the cBio portal 
for the COADREAD project (see http://www.cbioportal.org/). These display the physical distri¬ 
bution of the annotated mutations for each gene. Here we shown only genes with a total mutation 
count greater than 15; fam123b is called with its synonym amerI, as in the portal. 
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Exclusivity groups for MSS tumors 


Knowledge-based priors: WNT exclusivity 
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Knowledge-based priors: RAF exclusivity 
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Computational priors: MEMO group 

I iiiiHi: II III I I III I III! mil II I I 11 I I iiiii II nil II mil i ii imii m iii ii iiriiii i ii i m i ii ii i i stage 


mmimmiiiimim 

15% PIK3CA 

1 .. 1 .; 

" mm . 

5% ERBB2 

1 

Ill mi 

5% ERBB2 

III 

1 1 

3% PTEN 


mil 

■ ■ 3% PTEN 

1 

.. „| 

3%IGF2 


Computational priors: Mutex groups with score < 0.2 
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Figure S9: Groups of exclusive alterations for MSS tumors. Knowledge-based groups of 
exclusive alterations consist of: KRAS, NRAS and BRAF genes (raf pathway) and APC and ctnnbI 
genes (wnt pathway). The MEMO group identified in in this cohort consists of genes 
pikSca, erbb2, igf2 and pten. Finally, 6 groups are predicted by MUTEX with score below 
.2, one of these is equivalent to the known exclusive alterations in RAF pathway. 
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Figure SIO: Selected data for MSS tumors. Colorectal tumors with Microsatellite Stable 
clinical status in the TCGA COADREAD project, restricted to 152 samples with both somatic 
mutations and CNA data available. 33 driver genes annotated with 5 pathways are selected from 
the list published in to automatically detect groups of mutually exclusive alterations. Events 
selected for reconstruction are those involving genes altered in at least 5% of the cases, or part of 
group of alterations showing an exclusivity trend (see EigurejS^. This dataset is used to infer the 
set of selective advantage relations which constitute the MSS progression model presented in the 
Main Text. 
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♦ Non Parametric Bootstrap (score) TCGA MSS colorectal tumors 

♦ Temporal Priority (p-value) 

♦ Probability Raising (p-value) 

Colors (p-value) 



Figure Sll: Non-parametric bootstrap scores for MSS progression. Progression model for 
MSS tumors with confidence shown as edge labels. The first label represents the relation confidence 
estimated with 100 non-parametric bootstrap iterations, the second and third are p-values for 
temporal priority and probability raising. Red p-values are above the minimum significane threshold 
of .05. See Figure 4 in the Main Text for an interpretation of this model. 
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TCGA MSI-HIGH colorectal tumors 



Figure S12: Non-parametric bootstrap scores for MSI-HIGH progression. Progression 
model for MSI-HIGH tumors with confidence shown as edge labels. The first label represents the 
relation confidence estimated with 100 non-parametric bootstrap iterations, the second and third 
are p-values for temporal priority and probability raising. Red p-values are above the minimum 
significane threshold of .05. See Figure 5 in the Main Text for an interpretation of this model. 
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Figure S13: COADREAD statistics for models confidence. For BIC models we show statistics 
for edges with non-parametric bootstrap score approximately greater than 30%, for AIC models 
those greater than 50%s. i) p-values (100 repetition of non-parametric bootstrap, prior to Wilcoxon 
testing) for each edge statistics of selective advantage {direction and statistical dependence, and 
hypergeometric). In general, the edges that we selected show very strong support {p 10“^^), 
but for those edges connecting events with the same marginal frequencies, where we can not be 
confident in the edge direction {p > 0.05) but still we find strong statistical dependence, {ii) 
A posteriori model confidence against Type I and II errors estimated with non-parametric and 
statistical bootstraps (100 repetitions) - edges annotated in Figures Sll and S12| {Hi) Values of 
posterior classification and prediction errors are estimated from 10 repetitions of 10-fold cross- 
validation. The former reports how much errodls due to predicting, for each set of edges X = 
{xi, ..., Xn} y, the value of y according to the value of each Xi G X. The latter reports the same 
statistics when we predict y from the whole set of parents X. 
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Figure S14: Entropy loss for MSI-HIGH/MSS models. Violin plot computed from 10 runs 
of k-fo\d cross-validation with /c = 10, where we compute the “loss of log-likelihood” at each fold. 
In the plot and in the table we report also the overall log-likelihood, as well as the BIC and AIC 
scores for the models. We present the ratio of log-likelihood loss as a measure of stability of these 
models for these two datasets - we can observe that the MSS models lose < 1% of their likelihood, 
while the MSI lose slightly more (still, < 4%), possibly because of the smaller sample size. From 
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MSI Prediction Error (BIC parent set, k-fold cross-validation) 
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MSI Prediction Error (AlC parent set, k-fold cross-validation) 
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Figure S16: Prediction error for each parent set of BIC and AIC models of MSI-HIGH 
tumors. 


50 



















































(biomarker) 



(drug resistance) 



(metastatic) 



(anti-hallmark) 



< 

I— 

< 

o 


X Y Z .. 


1 1 

6 ... 

1 1 

0 ... 

1 1 

0 ... 

1 0 

1 ... 

1 0 

1 ... 

1 0 

1 ... 

1 0 

0 ... 

1 0 

0 ... 

1 0 

0 ... 


samples with 
the X/Y genotype 


samples with 
the X/Z genotype 


samples with 
the X genotype 


the genotype X/Y/Z 
that we never 
observe could be 

on anti-hallmark 


Figure S17: Models and the phenotype that they might explain, (biomarker) Indepen¬ 
dent evolutionary trajectories depicted by a model might share common routes through a certain 
alteration Y; that could point to a new biomarker harbored by most of the tumors under study, 
(drug resistance) When a progression model branches in many independent sub-progressions, 
each one identified by alterations X, Y and Z, if a certain drug is known to target only a certain 
type of such clones (e.g., those where biomarker X is present), we might get insights on which are 
the biomarkers which make the drug ineffective for certain patients (e.g., those were cancer evolves 
through Y and Z). (metastatic) When a model is extracted from data representative of various 
tumor stages, we might discover which “late events” are those conferring a metastatic phenotype 
to a tumor - X in the figures, (anti-hallmarks) Relation between anti-hallmarks and formulas. 
Exclusivity formulas allow to capture fitness-equivalent events (Y and Z in the figure), and the 
presence of alternative routes - here those identified by the genotypes X/Y or X/Z. These could 
point us to genotype/phenotype that we do no observe in our cohort - here the X/Y/Z - which 
could be exploited for targeted therapy if a synthetic lethality is screened among Y and Z, the 
anti-hallmark. 
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